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Abstract 

Recent developments in system identification have bronght attention 
to regnlarized kernel-based methods, where, adopting the recently intro- 
dnced stable spline kernel, prior information on the nnknown process 
is enforced. This reduces the variance of the estimates and thus makes 
kernel-based methods particularly attractive when few input-output data 
samples are available. In such cases however, the influence of the system 
initial conditions may have a signihcant impact on the output dynamics. 
In this paper, we specifically address this point. We propose three methods 
that deal with the estimation of initial conditions using different types of 
information. The methods consist in various mixed maximum likelihood-a 
posteriori estimators which estimate the initial conditions and tune the 
hyperparameters characterizing the stable spline kernel. To solve the 
related optimization problems, we resort to the expectation-maximization 
method, showing that the solutions can be attained by iterating among 
simple update steps. Numerical experiments show the advantages, in terms 
of accuracy in reconstructing the system impulse response, of the proposed 
strategies, compared to other kernel-based schemes not accounting for the 
effect initial conditions. 


1 Introduction 

Regularized regression has a long history [^. It has become a standard tool in 
applied statistics [^, mainly due to its capability of reducing the mean square 
error (MSE) of the regressor estimate |^, when compared to standard least 
squares |^. Recently, a novel method based on regularization has been proposed 
for system identification . In this approach, the goal is to get an estimate of 
the impulse response of the system, using the so called kernel-based methods . 
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risuleoQkth. se , bottegal@kth. se , hjalmarsQkth. se). This work was supported by the 
European Research Council under the advanced grant LEARN, contract 267381 and by the 
Swedish Research Council under contract 621-2009-4017. 
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To this end, the class of stable spline kernels has been proposed recently in 
The main feature of these kernels is that they encode prior information on the 
exponential stability of the system and on the smoothness of the impulse response. 
These features have made stable spline kernels suitable for other estimation 
problems, such as the reconstruction of exponential decays and correlation 


functions 10 . Other kernels for system identification have been introduced in 


subsequent studies, see for instance 11 , 12 


Stable spline kernels are parameterized by two hyperparameters, that deter¬ 
mine magnitude and shape of the kernel and that need to be estimated from 
data. An effective approach for hyperparameter estimation relies upon empirical 
Bayes arguments 13 . Specifically, exploiting the Bayesian interpretation of reg¬ 


ularization 14 , the impulse response is modeled as the realization of a Gaussian 
process whose covariance matrix corresponds to the kernel. The hyperparameters 
are then estimated by maximizing the marginal likelihood of the output data, 
obtained by integrating out the dependence on the impulse response. Given a 
choice of hyperparameters, the unknown impulse response is found by computing 
its minimum MSE Bayesian estimate [^. 

One situation where kernel-based methods are preferable is when data records 
are short (e.g., five times the rise time of the system). This mainly because of 
two reasons: 


1. Kernel-based methods do not require the selection of a model order. Stan¬ 
dard parametric techniques (such as the prediction error method |^, [|15| ) 
need to rely on model selection criteria, such as AIG or BIG, if the structure 
of the system is unknown [16| . These could be unreliable when faced with 
small data sets. 

2. The bias introduced by regularization reduces the variance. With small 
data records, the variance can be very high. If the bias is of the right kind, 
it will compensate for the variance effect in the MSE Ch. 2.9]. 

When data records are very short (e.g., two times the rise time of the system) 
we cannot ignore the effect of the initial conditions. In fact, if the system is 
not at rest before the experiment is performed, then there are transient effects 
that cannot be explained using only the collected data. Standard workarounds, 
such as discarding those output samples that depend on the initial conditions or 
approximating the initial conditions to zero Gh. 10.1], may give unsatisfactory 
results. Thus, it seems preferable to deal with the initial conditions by estimating 
them. In this paper we discuss how to incorporate the estimation of the initial 
conditions in the context of kernel-based system identification. We discuss three 
possible approaches to the problem. First, we propose a method that incorporates 
the unknown initial conditions as parameters, to be estimated along with the 
kernel hyperparameters. Then, assuming that the input is an autoregressive- 
moving-average (ARMA) stationary process, we propose to estimate the initial 
conditions using the available samples of the input, thus designing a minimum 
variance estimate of the initial conditions from the input samples. Finally, we 
design a mixed maximum a posteriori-marginal likelihood (MAP-ML) estimator 
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(see 17 ) that effectively exploits information from both input and output data. 
We solve the optimization problems using novel iterative schemes based on the 
expectation-maximization (EM) method 18 , similar to the technique used in 
our previous works 19 and 120], where methods for Hammerstein and blind 


system identification are proposed. We show that each iteration consists of a set 
of simple update rules which either are available in closed-form or involve scalar 
optimization problems, that can be solved using a computationally efficient grid 
search. 

The paper is organized as follows. In Section]^ we formulate the problem 
of system identification with uncertainty on the initial conditions. In Section 3 
we provide a short review of kernel-based system identification. In Section 4 
we propose the initial-conditions estimation strategies and the related system 
identification algorithms. In Section we show the results of numerical ex¬ 
periments. In these experiments, the discussed method are compared with 
standard techniques used to deal with unknown initial conditions. In Section]^ 
we summarize the work and conclude the paper. 


2 Problem formulation 

We consider the output error model of the form 

OO 

yt = ^9kUt-k + vt, ( 1 ) 


where is the impulse response of a linear time-invariant system. For 

notational convenience, we assume there are no delays in the system (go 0). 
We approximate g by considering its first n samples {gt}"^o ’ ""^here n is chosen 
large enough to capture the system dynamics. The system is driven by the input 
Ut and the measurements of the output yt are corrupted by the process Vt, which 
is zero-mean white Gaussian noise with variance cr^. 

Given a set of N measurements, denoted by ^.re 

interested in estimating the first n samples of the impulse response ■ To 

this end, we formulate this system identification problem as the linear regression 
problem 

y=Ug + v, (2) 

where we have introduced the following vector/matrix notation 


yo 


go 


Vo 


, g ■= 


, V := 


1 

1_ 




1 

1 

1_ 
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Uo 

U_1 

U-2 



Ui 

Uo 

U-1 

U-n-\-2 

u = 

U2 

Ui 

Uo 

U-n-\-3 


UN-1 

UN-2 

UN-3 

'^N—n 


The matrix U contains the samples u_i,..., U-n+i, that we call initial conditions, 
that are unavailable. Common ways to overcome this problem are, for instance 

• Discard the first n — 1 collected samples of y. However, if N is not much 
larger than n, (e.g., if n ^ 100 and N ^ 200), there is a considerable loss 
of information. 

• Assume that the system is at rest before the experiment is performed, 
(i.e. U-i, ..., U-n+i = 0). This assumption might be too restrictive or 
unrealistic. 

In this paper, our aim is to study how to exploit the available information to 
estimate the initial conditions, in order to improve the identification performance. 
Specifically, we will present three estimators that make different use of the 
available information. 


3 Kernel-based system identification 


In this section we briefly review the kernel-based approach introduced in 0. i- 
Exploiting the Bayesian interpretation of kernel-based methods 14 , we model 


the unknown impulse response as a Gaussian random process, namely 


g^N{t),XKp). (4) 

We parameterize the covariance matrix Kp (the kernel) with the hyperparameter 
p. The structure of the kernel determines the properties of the realizations 
of its choice is therefore of paramount importance. An effective kernel for 
system-identification purposes is the stable spline kernel [^, [21| . In particular, 
in this paper we use the first-order stable spline kernel (or TC kernel in [Ilj ), 
that is defined as 

:= , (5) 

where /3 is a scalar in the interval [0, 1). The role of this hyperparameter is to 
regulate the velocity of the exponential decay of the impulse responses drawn 
from the kernel. The hyperparameter A > 0 is a scaling factor that regulates the 
amplitude of the realizations of Q . 

We collect the hyperparameters into the vector 

p-.= [\ P] (6) 
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and introduce the following notation: 






Uo 

U- 






U- := 

^-1 

u+ := 

_Un-1_ 


where U- contains the unknown initial conditions. Since we have assumed a 
Gaussian distribution for the noise, the joint description of y and g is Gaussian, 
parameterized by U- and p. Therefore, we can write 


P 






(7) 


where = Sjg = XUKp and Eg = XUKpU^ + It follows that the 
posterior distribution of g given y is Gaussian, namely 


P{9\y\ P, U-) =Af {g, Eg|g) , 


( 8 ) 


where 

'^a\v~ ^ J > 9 — ^g\y~^y- (9) 

Equation ( [^ i mplies that the minimum variance estimator of g (in the Bayesian 
sense, see |^) is 

9 = E[5|y; P, U-]. (10) 


The estimate g depends on the hyperparameter vector p and the initial conditions. 
These quantities need to be estimated from data. In the next section we focus 
our attention to the estimation of the kernel hyperparameters and the initial 
conditions, describing different strategies to obtain these quantities. 


Remark 1. The estimator (10 1 depends also on the noise variance . In this 
work, we assume that this parameter is known. It can for instance be estimated 
by fitting a least-squares estimate of the system g and then computing the sample 
variance of the residuals. □ 


4 Estimation of initial conditions and hyper pa¬ 
rameters 

In most works on kernel-based system identification (see e.g. for a survey), 
the authors adopt an empirical-Bayes approach to estimate the hyperparameters 
that define the kernel. This amounts to maximizing the marginal likelihood 
(ML) of the output, found integrating g out of ([^. 

In the standard case, that is when u- is assumed to be known, the ML 
estimator of the hyperparameters corresponds to 

p = argmaxp(?/;p, u_). (11) 

P 

We start from © to design new estimators for the initial conditions and the 
kernel hyperparameters. 
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4.1 Model-less estimate 


The most straightforward generalization of (11) is to include the initial conditions 
among the ML parameters. The initial conditions become unknown quantities 
that parameterize the impulse response estimator. The ML criterion then 
becomes 

p, u- = a.Ygmaxp{y; p, U-) , (12) 

Pi 

where the maximization is carried out over the unknown initial conditions as well. 
This problem is nonconvex and possibly high dimensional, as the number of initial 
conditions to be estimated is equal to the number of impulse response samples. 
To overcome this difhculty, we devise a strategy based on the expectation- 
maximization method that yields a solution to (121 by iterating simple updates. 
To this end, we suppose we have any estimate of the unknown quantities 
and u_', and we calculate the current estimate of the impulse response, as well 
as its variance, from Define the matrix 

(13) 


which is the second moment of the current estimated impulse response. We 
introduce the discrete-derivator matrix 


and we calculate the second moment of the derivative of the estimated impulse 
response at iteration k, Ag^^\ given by 

£)(fc) ^ (15) 


The Toeplitz matrix of the input samples U can be split in two parts, namely 
U = U+ U-, where U+ is fully determined by the available samples, and U- 
is composed of the unknown initial conditions. Define the matrix R G 
that satisfies the relation Ru_ = vec(?7_); and call the Toeplitz matrix of 
the estimated impulse response at the fcth iteration. Furthermore, define 


i«:=R'r S(^) 




R, 


^ :=vec(17+)^ 0 IatI R - 


(16) 


With all the definitions in place, we can state the theorem that provides us with 


the iterative update of the estimates that solves (12l. 


Theorem 1. Consider the hyperparameter estimator (12l. Starting from any 


initial guess of the initial conditions and the hyperparameters, compute 


^(fc+i) ^ ( 17 ) 

= arg min Q(/3) , (18) 

/3e[o,i) 

1 ^ 

= (19) 

• 1 
2 = 1 
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with and V'^'^ defined in (161, and 


Q(/3) := n log /(/3) + ^ log fi - log(l - fi), (20) 

n— 1 

/(/3) := ^ df r* + di^\l - ; (21) 


^(k) I I 

where d] ' is the ith diagonal element of (15), and j is the ith element of 


Wjs := 


/ 3-/32 


1 4 


1-/3 


( 22 ) 


when fi = Let then the sequences 

and converge to a maximum of (El- 

Proof. See Appendix. □ 


Remark 2. The EM method does not guarantee convergence of the sequences 
to a global maximum (see l2^ and J24^ ). However, experiments (see Section^ 
have show that, for this particular problem, the EM method converges to the 
global maximum independently of how it is initialized. 


Thus, we can use Theorem to find a maximum of the marginal likelihood of 
the hyperparameters and the initial conditions, and then use these parameters 
to solve the impulse response estimation problem with (10), where we use the 
limits of the sequences of estimates. 


4.2 Conditional mean estimate 

The model-less estimator presented in Section [4. 1 1 estimates the initial conditions 
using only information present in the system output y. It does not rely on any 
model of the input signal u. To show how an available model can be used to 
estimate the missing initial conditions, we introduce the following assumption. 


Assumption 1. The input ut is a realization of a stationary Gaussian process 
with zero-mean and known rational spectrum. Equivalently, Ut is a realization 
on an ARMA process with known coefficients. □ 


Assumption [T] implies that Ut can be expressed as the output of a difference 
equation driven by white Gaussian noise with unit variance 25 , namely 


Ut d\Ut—i -)-***-)- dpUt—p — cqC^ “)“***“)“ CpCt—p 


(23) 


where e* ^ Af{0, 1). Since, using Assumption]^ we can construct the probability 
density of the input process, a possible approach to solve (111 is to estimate the 
missing initial conditions from the input process. To this end, consider (23). If 
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we define the matrices D as the toeplitz matrix of the coefficients 0, di, c? 2 ,.. •, 
and C as the toeplitz matrix of the coefficients ci, C 2 , •.., then we can write 

u=—Du + Ce, e := [e_„+i ••• ew]^ , (24) 

so that p{u) ^ A/’(0, T,u), with 

= + (25) 


We thus have a joint probabilistic description of the initial conditions u_ and 
the available input samples u_|_. We can partition into four blocks according 
to the sizes of U- and 



It follows (see [^) that the posterior distribution of the unavailable data is 
p{u-\u+) = A/’(m_|+, S_|_|_), where 


u_|+ = S_, S_|+ = S_ — S__. (26) 


So we can find the minimum variance estimate of U- as the conditional mean 
u_|+, namely {(_ = m_| + . 

Having an estimate of the initial conditions, we need to find the hyperparam¬ 
eters that define the kernel. In this case, empirical Bayes amounts to solving the 
ML problem 

p = argmaxp(j/; p, u_), (27) 

p 

where the unknown initial conditions have been replaced by their conditional 
mean. The following theorem states how to solve the maximization using the 
EM method. 


Theorem 2. Consider the hyperparameter estimator (27l. Starting from an 
initial guess of the hyperparameters, compute 


p(k+l) _ ggjg Qijd) 


X(fc+1) = lyj; 

71 


j(^) „ 




(28) 

(29) 


withQ{l3), andw 0 (k-\-i) j defined in Theorem^ Let 


then the sequence converges to a maximum of (27l. 

Proof. See Appendix. 


□ 


Remark 3. The updates (281 and (29) require the evaluation of (15) at each 
iteration. In this case the estimate g'^ of the impulse response is given by (10), 
where U- is replaced by its conditional mean. □ 











4.3 Joint input-output estimate 

The conditional mean estimator presented in Section [4.2| exploits the structure 
of the input to estimate the missing samples. The model-less estimator, instead, 
uses information contained in the output samples. In this section we show how 
to merge these two information sources, defining a joint input-output estimator 
of the initial conditions. 

We use Assumption to account for the statistical properties of U-. We 
propose the following mixed MAP-ML estimator 


p, u- = argmaxp(?/1 M_, u+; p)p{u-\u+), (30) 

p, U — 


where we have highlighted the dependence on the known input sequence A 
key role is played by the term p{u- \ m+): it acts as a prior distribution for the 
unknown values of it_ and puts weight on the values that better agree with the 
observed data u+. 

Even in this case, the solution can be found with an iterative procedure based 
on the EM method. 


Theorem 3. Consider the hyperparameter estimator (30l. Starting from an 
initial guess of the initial conditions and the hyperparameters, compute 


'.(fe-Hi) _ 


V 


.-1, 


--fS 


-1 




-1+ 


-|-E liA— 


-|+«-K ’ 


^(fe-l-l) _ g^j-g ggjg Q(^) ^ 

/3e[o,i) 


1 


A('=+i) = - Vd: 


ik) 






(31) 

(32) 

(33) 


with and from and with Q{j3), and w^(k+i) , defined in 

Theorem 0 Let then, the seguences and 

converge to a maximum of ( [M| . 

Proof. See Appendix. □ 


Remark 4. This estimator incorporates the prior information about the initial 
conditions using the mean u._|+ and the covariance matrix E_|_|_. If we suppose 
that we can manipulate we can see this estimator as a more general 

estimator, that contains the model-less and conditional mean as limit cases. In 
fact, setting S-|+ = oo, we get the model-less estimator. Conversely, setting 
S_|_|_ = 0, we obtain the conditional-mean estimator, as (311 would yield a 
degenerate iteration where all the updates are = u_|+. We point however 

out that the model-less estimator does not rely on any assumption on the input 
model, whereas the joint input-output estimator requires that the input is a 
Gaussian process with known pdf. 
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5 Numerical experiments 


5.1 Experiment setup 


To compare the proposed methods, we perform five numerical experiments, each 
one consisting of 200 Monte Carlo simulations, with sample sizes 150, 200, 250, 
300, and 400. At each Monte Carlo run, we generate a dynamic system of order 
40. The system is such that the zeros are constrained within the circle of radius 
0.99 on the complex plane, while the magnitude of the poles is no larger than 
0.95. The impulse response length is 100 samples. The input is obtained by 
filtering white noise with unit variance through a 8-th order ARMA filter of 


the form (231. The coefficients of the filter are randomly chosen at each Monte 


Carlo run, and they are such that the poles of the filter are constrained within 
the circular region of radii 0.8 and 0.95 in the complex plane. 

Random trajectories of input and noise are generated at each run. In 
particular, the noise variance is such that the ratio between the variance of the 
noiseless output of the system and the noise variance is equal to 20. 

The following estimators are compared during the experiments. 


• KB-IC-Zeros: This method does not attempt any estimation of the initial 
conditions. It sets their value to 0 (that, when assumption]^ holds, corre¬ 
sponds to the a-priori mean of the vector u-). The kernel hyperparameters 
are obtained solving ( [TI| ), with u- = 0. 


KB-Trunc: This method also avoids the estimation of the initial conditions 
by discarding the first n — 1 output samples, which depend on the unknown 
vector U-. The hyperparameters are obtained solving (111, using the 
truncated data. 


• KB-IC-ModLess: This is the model-less kernel-based estimator presented 
in Section |4T] 

• KB-IC-Mean: This is the conditional mean kernel-based estimator pre¬ 
sented in Section lT2l 

• KB-IC-Joint: This is the joint input-output kernel-based estimator pre¬ 
sented in Section IT^ 




KB-IC-Oracle: This estimator has access to the vector u_, 
the kernel hyperparameters using (111. 


and estimates 


The performances of the estimators are evaluated by means of the fitting 
score, computed as 


FITi = 100 


_ llg» -gi||2 \ 


(34) 


where gi is the impulse response generated at the i-th run, gi its mean and gi 
the estimate computed by the tested methods. 
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N 

150 

200 

250 

300 

400 

KB-IC-Zeros 

51.698 

54.856 

61.151 

61.380 

63.186 

KB-Trunc 

42.010 

51.038 

59.400 

61.085 

62.963 

KB-IC-ModLess 

54.017 

55.793 

61.687 

63.074 

63.466 

KB-IC-Mean 

54.146 

56.003 

62.061 

63.715 

64.187 

KB-IC-Joint 

55.695 

57.133 

62.776 

64.310 

64.457 

KB-IC-Oracle 

57.317 

57.902 

63.781 

64.893 

64.959 


Table 1: Table of experimental results. Shown is the average fit in percent over 
the different experiments. 


5.2 Results 

Table shows the average fit (in percent) of the impulse response over the Monte 
Carlo experiments. We can see that, for short data sequences the amount of 
information discarded by the estimator KB-Trunc makes its performance degrade 
with respect to the other estimators. The estimator KB-IC-Zeros performs better, 
however suffers from the effects of the wrong assumption that the system was 
at rest before the experiment was performed. From these results, we see that 
the estimation of the initial conditions has a positive effect on the accuracy of 
the estimated impulse response. For larger data records the performances of the 
estimator KB-IC-Mean and of the estimator KB-IC-ModLess improve, as more 
samples become available. 

When the available data becomes larger, all the methods perform well, with 
fits that are in the neighborhood of the fit of the oracle. 

The positive performance of KB-IC-Mean indicates that the predictability of 
the input can be exploited to improve estimates, and that model-based approaches 
to initial condition estimation outperforms model-less estimation methods (if the 
input model is known). The further improvement of KB-IC-Joint indicates that 
the output measurements can be used to obtain additional information about 
the unobserved initial conditions, information that is not contained in the input 
process itself. 


6 Discussion 

We have proposed three new methods for estimating the initial conditions of a 
system when using kernel-based methods. Assuming that the input is a stationary 
ARMA process with known spectrum, we have designed mixed MAP-ML criteria 
which aim at jointly estimating the hyperparameters of the kernel and the initial 
conditions of the systems. To solve the related optimization problems, we have 
proposed a novel EM-based iterative scheme. The scheme consists in a sequence 
of simple update rules, given by unconstrained quadratic problems or scalar 
optimization problems. Numerical experiments have shown that the proposed 
methods outperform other standard methods, such as truncation or zero initial 
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conditions. 

The methods presented here estimate n — 1 initial conditions (where n is the 
length of the FIR approximating the true system), since no information on the 
order of the system is given. Assuming that the system order is known and equal 
to say, p, the number of initial conditions to be estimated would boil down to p. 
However, there would also be p unknown transient responses which need to be 
identified. These transients would be characterized by impulse responses with 
the same poles as the overall system impulse response, but with different zeros. 
How to design a kernel correlating these transient responses with the system 
impulse response is still an open problem. 


Appendix: Proofs 


6.1 Theorem [T] 


Consider the ML criterion (12). To apply the EM method, we consider the 
complete log-likelihood 

L{y,g) = -^\\y-Ugr-^loga^ 

- ^9'^{>^Ki3)~^g- ^logdet (XK^). 


where we have introduced g as a latent variable. Suppose that we have computed 
the estimates of the hyperparameters and of the initial conditions. We 
define the function 


Q{p,u-; :=E[L{y, g)] , (35) 

where the expectation is taken with respect to the conditional density p( 5 |y; p^^\ 
defined in (|^. We obtain (neglecting terms independent from the optimization 
variables) 

Q{p,u-\ p^''\u^^^) = Qi{u--, + Q 2 {p-, (36) 

where 

- ^(y^t/_g + tr{(C/^C/_ -2C/^C/+),§W}) , (37) 

Q2(p;p("^wL"^) = -^tr{(AiF;3)-i5W}- ilogdet(AiF;3), (38) 
and = Eg|j, + g g^. Define new parameters from 

= argmax Q(p, u-; (39) 

PfU- 


12 



By iterating between (35) and (39) we obtain a sequence of estimates that 
converges to a maximum of (12) (see e.g., 23 for details). Using (36), (39) splits 
in the two maximization problems 


-(fc+i) _ a,rgmax(5i(w_; , 

U — 

(40) 

p(k+i) _ argmax(52(/3; ■ 

(41) 


Consider the maximization of (37) with respect to . Using the matrix R 
we have 


tr{c/Z’C/_S'('=)} = O In 

tr{t7)rt/+S'('=)} = vec{U+f 0 


Ru_ , 




Rw_ , 


where 0 denotes the Kronecker product. We now collect rt_, and obtain 




1 


(42) 

(43) 

(44) 


2 a^ 

with and 6^^) defined in (16). Hence, (40) is an unconstrained quadratic 
optimization, whose solution is given by (17). 

Consider now the maximization of (38) with respect to p. We can calculate 

(45) 


the derivative of <52 with respect to A, obtaining 


^ = -^trl 
dX 2A2 ^ 


n 


which is equal to zero for 


A- = l(lr{i<--S(‘>}). 


(46) 


We thus have an expression of the optimal value of A as a function of /3. If we 
insert this value in Q 2 , we obtain 


Q2([A*/3]; = 

1 


log (tr I ^ I ^ i log det Kp + ki 


(47) 


where ki is constant. We now rewrite the first order stable spline kernel using 

(48) 


the factorization (see 26 


Kn = . 


where A is defined in (14) and 


Wp := (/3 - /3^)diag 


|l,/3, ..., 


n—2 


on—l 


’ 1-13 


(49) 
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From (151, we find 


Q2([A*/3];p«,wL'=^) = 


I log ^ J; 


W „„-1 


+ ^l0gW/3,*i + k2 : 


(50) 


where and W/s^u are the i-th diagonal elements of and of respectively, 
and k 2 is a constant. If we define the function (211, we can rewrite (501 as (20). 


as (19). 


so that we obtain (20) and (21). Using a similar reasoning, we can rewrite (46) 


□ 


Theorem I 


Consider the ML criterion (27). The proof follows the same arguments as the 


proof of Theorem with the optimization carried out on p only and with 

U- = U_|+. □ 


Theorem |3] 

Consider the ML criterion ( [M| ). Consider the complete data log-likelihood 

i2(?/,d) := logp(2/, 5, u_; p,), (51) 

Given any estimates and , take the expectation with respect to p{g\y, 

We obtain 


E [L2 (p, p)] = Qi (u-; p("), ) + ^2 (p; p("), ) 

- \ U-\ + )^^Zl+{u-- M-I + ) , 


(52) 


with Qi and Q 2 defined in ( [^ and ( [38| . Collecting the terms in m_ and us¬ 
ing (44) we obtain an unconstrained optimization problem in m_, that gives (31). 
The optimization in p follows the same procedure as in Theorem [^and gives ( |32[ ) 
and (li^. 
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